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Abstract 



The structure of a Bayesian network includes a great deal of information about 
the probability distribution of the data, which is uniquely identified given some 
^^D general distributional assumptions. Therefore it's important to study its vari- 

^_^ ability, which can be used to compare the performance of different learning 

fTl algorithms and to measure the strength of any arbitrary subset of arcs. 

^H In this paper we will introduce some descriptive statistics and the corre- 

^5 sponding parametric and Monte Carlo tests on the undirected graph underlying 

the structure of a Bayesian network, modeled as a multivariate Bernoulli random 
variable. A simple numeric example and the comparison of the performance of 
some structure learning algorithm on small samples will then illustrate their 
use. 

(N 

^ Keywords: Bayesian network, bootstrap, multivariate Bernoulli distribution, 

^^ structure learning algorithm. 

^H 

(N 

1. Introduction 

in 

^^ In recent years Bayesian networks have been successfully applied in several 

different disciplines, including medicine, biology and epidemiology (see for exam- 
ple Friedman et al. [T] and Holmes and Jain [2]). This has been made possible by 
the rapid evolution of structure learning algorithms, from constraint-based ones 
(such as PC 15, Grow-Shrink [4], IAMB [^ and its variants [6]) to score-based 
;_! (such as TABU search [7j, Greedy Equivalent Search [Sj and genetic algorithms 

^ [5]) and hybrid ones (such as Max-Min Hill Climbing [lOj). 

The main goal in the development of these algorithms has been the reduc- 
tion of the number of either independence tests or score comparisons needed to 
learn the structure of the Bayesian network. Their correctness has been proved 
assuming either very large sample sizes in relation to the number of variables 
(in the case of Greed Equivalent Search) or the absence of both false positives 
and false negatives (in the case of Grow-Shrink and IAMB). In most cases the 
characteristics of the learned networks were studied using a small number of 
reference data sets [11] as benchmarks, and differences from the true structure 
measured with purely descriptive measures such as Hamming distance |12j . 
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This approach to model evaluation is not possible for real world data sets, 
as the true structure of their probability distribution is not known. An alter- 
native is provided by the use of either parametric or nonparanietric bootstrap 
|13j . By applying a learning algorithm to a sufficiently large number of boot- 
strap samples it is possible to obtain the empirical probability of any feature 
of the resulting network [T3] , such as the structure of the Markov Blanket of a 
particular node. The fundamental limit in the interpretation of the results is 
that the "reasonable" level of confidence for thresholding depends on the data. 

In this paper we propose a modified bootstrap-based approach for the infer- 
ence on the structure of a Bayesian network. The undirected graph underlying 
the network structure is modeled as a multivariate Bernoulli random variable 
in which each component is associated with an arc. This assumption allows 
the derivation of both exact and asymptotic measures of the variability of the 
network structure or any of its parts. 

2. Bayesian networks and bootstrap 

Bayesian networks are graphical models where nodes represent random vari- 
ables (the two terms are used interchangeably in this article) and arcs represent 
probabilistic dependencies between them fT5]. 

The graphical structure Q = (V, A) of a Bayesian network is a directed 
acyclic graph (DAG) which defines a factorization of the joint probability distri- 
bution of V = {^1, X2, ■ ■ ■ , Xy}, often called the global probability distribution, 
into a set of local probability distributions, one for each variable. The form of 
the factorization is given by the Markov property of Bayesian networks, which 
states that every random variable Xi directly depends only on its parents Hxi ■ 

Therefore it is important to define confidence and variability measures for 
specific features in the network structure, such as the presence of specific config- 
urations of arcs. In particular a measure of variability for the network structure 
as a whole has many applications both as an indicator of goodness of fit for a 
particular Bayesian network and as a criterion to evaluate the performance of a 
learning algorithm. 

Confidence measures have been developed by Friedman et al. [U] using boot- 
strap simulation, and later modified by Imoto et al. (TB] to estimate the marginal 
confidence in the presence of an arc (called edge intensity, and also known as 
arc strength) and its direction. This approach can be summarized as follows: 

1. For 6= 1,2, ...,TO 

(a) re-sample a new data set DJj from the original data D using either 
parametric or nonparanietric bootstrap. 

(b) learn a Bayesian network Qt from D^. 

2. Estimate the confidence in each feature / of interest as P(/) = (1/m) J2T=i fiSb)- 

However, the empirical probabilities P(/) are difficult to evaluate, because 
the distribution of G in the space of DAGs is unknown and because the confi- 
dence threshold value depends on the data. 
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3. The multivariate Bernoulli distribution 

Let Bi,B2, ■ ■ ■ ,Bk, k e N be Bernoulli random variables with marginal 
probability of success pi,P2, ■ ■ ■ ,Pk, that is Bi ~ Ber(pi), i — 1, . . . , fc. Then 
the distribution of the random vector B = [Bi,B2, ■ ■ ■ ,Bk]^ over the joint 
probability space oi Bi, B2, ■ ■ ■ , Bk is a multivariate Bernoulli random variable 
[17j . denoted as -Berfe(p). Its probability function is uniquely identified by 
the parameter collection p = {pj :/C{l,...,fc},/^0}, which represents the 
dependence structure among the marginal distributions in tcrnrs of simultaneous 
successes for every non-empty subset / of elements of B. 

However, several useful results depend only on the first and second order 
moments of B 

(1) 

p1 (2) 

== Pi] - PiPj (3) 

and the reduced parameter collection p = {pij : i, j — I, . . . , k}, which can be 
used as an approximation of p in the generation random multivariate Bernoulli 
vectors in Krummenauer fl8|. 

3.1. Uncorrelation and independence 

We will first consider a simple result that links covariance and independence 
of two univariate Bernoulli variables. 

Theorem 1. Let Bi and Bj be two Bernoulli random variables. Then Bi and 
Bj are independent if and only if they are uncorrelated. 

Proof. If Bi and Bj are independent, then by definition CO\/{Bi, Bj) — 0. If on 
the other hand we have that COV(_Bi, Bj) = 0, then pij = piPj which completes 
the proof. D 

This theorem can be extended to multivariate Bernoulli random variables as 
follows. 

Theorem 2. Let B = [^1,^2, . . . , B^]^ and C = [Ci, C2, . . . , Cif , k,l eN be 
two multivariate Bernoulli random variables. Then B and C are independent 
if and only i/COV(B, C) = O, where O is the zero matrix. 

Proof. If B is independent from C, then by definition every pair {Bi,Cj), i = 
1, . . . ,k, j = 1, . . . ,1 is independent. Therefore COV(B, C) = [cy] = O. 

If conversely COV(B, C) — O, every pair {Bi, Cj) is independent as Cij = 
implies pij = piPj . This in turn implies the independence of the random vectors 
B and C, as their sigma-algebras cr(B) = cr(^i) x . . . x a{B}S) and cr(C) = 
(t(Ci) X ... X a{Ci) are functions of the sigma-algebras induced by the two sets 
of independent random variables Bi, B2, . . . , Bk and Ci, C2, . . . , C/. D 



The correspondence between uncorrelation and independence is identical 
to the analogous property of the multivariate Gaussian distribution (TH], and 
is closely related to the strong normality defined for orthogonal second order 
random variables in Loeve [20]. It can also be applied to disjoint subsets of 
components of a single multivariate Bernoulli variable, which are also distributed 
as multivariate Bernoulli random variables. 

Theorem 3. Let B = [Bi, B2, ■ ■ ■ , B^]'^ be a multivariate Bernoulli random 
variable; then every random vector B* = [i?ij, Bi^, . . . , -Bi,]^, {ii, 12, ... , ii} Q 
{1, 2, . . . , fc} is a multivariate Bernoulli random variable. 

Proof. The marginal components of B* are Bernoulli random variables, be- 
cause B is multivariate Bernoulli. The new dependency structure is defined 
as p* = {pi- : I* (= {ii, . . ■ , ii}, I* 7^ 0}, and uniquely identifies the probability 
distribution of B* . D 

3.2. Properties of the covariance matrix 

The covariance matrix S = [<7ij], i,j — \, . . . ,k associated with a multivari- 
ate Bernoulli random vector has several interesting numerical properties. Due 
to the form of the central second order moments defined in formulas [2] and [3] 
the diagonal elements an are bounded in the interval [0, 1/4]. The maximum is 
attained for pi = 1/2, and the minimum for both pi = and pi = 1. For the 
Cauchy-Schwarz theorem \19\ then \aij\ £ [0, 1/4]. 

The eigenvalues Ai, A2, . . . , Afc of S are similarly bounded, as shown in the 
following theorem. 

Theorem 4. Let B = [Bi, B2, ■ . ■ , Bk]^ be a multivariate Bernoulli random 
variable, and let E = ['^ij]? hj = l,...,fc be its covariance matrix. Let Xi, 
i = 1, . . . , A: be the eigenvalues ofT,. Then ^ X)i=i ^i ^ ^/^ o.i^d, ^ Ai ^ fc/4. 

Proof. Since S is a real, symmetric, non-negative definite matrix, the eigenval- 
ues Xi are non-negative real numbers [21j : this proves the lower bound in both 
inequalities. 

The upper bound in the first inequality holds because 

^ Aj = ^ an ^ max ^ o-ji = ^ max a^ ^ -, (4) 

i=l 1=1 ^'^"' 1=1 4=1 

and this in turn implies Xi ^ J2i=i ^i ^ ^/^i which completes the proof. D 
These bounds define a convex set in M.'', defined by the family 

fc" 



% 



V= <A''-\c) :ce 
where A''~'^{c) is the non-standard fc — 1 simplex 



(5) 



k 



A'=-i(c)=<^(Ai,...,Afc)GM'=:^A, = c,A, ^0 . (6) 



3.3. Sequences of multivariate Bernoulli variables 

Consider now a sequence of independent and identically distributed multi- 
variate Bernoulli variables Bi, B2, . . . , Bm ~ Berk{p). The sum 

m 

S,n = ^B,^Bik{m,p) (7) 

is distributed as a multivariate Binomial random variable |17j . thus preserving 
one of the fundamental properties of the univariate Bernoulli distribution. A 
similar result holds for the law of small numbers, whose multivariate version 
states that a fc-variate Binomial distribution Bik{m,p) converges to a multi- 
variate Poisson distribution Pk{A): 

Sm 4 Pfc(A) as mp^-A. (8) 

Both these distributions' probability functions, while tractable, are not very 
useful as a basis for closed-form inference procedures. An alternative is given by 
the asymptotic multivariate Gaussian distribution defined by the multivariate 
central limit theorem |19| : 

^™^"^'^^4a^.(0,I]). (9) 



The limiting distribution is guaranteed to exist for all possible values of p, as 
the first two moments are bounded and therefore are always finite. 

4. Inference on the network structure 

Let U = (V,E) be the undirected graph underlying a DAG G = (V,^), 
defined as its unique bioricntation "22: . Each edge e G E oi U corresponds to 
the directed arcs in A with the same incident nodes, and has only two possible 
states (it's either present in or absent from the graph). 

Then each possible edge e^, i = 1, . . . , |V|(|V| — l)/2 is naturally distributed 
as a Bernoulli random variable 

{Ci e £' with probability pi 
• ■ ■ O-^j 

Ci ^ E with probability 1 — Pi 

and every set 14^ C V x V (including E) is distributed as a multivariate 
Bernoulli random variable W and identified by the parameter collection pw = 
{Pw : w CW,w y^ 0}. The elements of pw can be estimated via parametric 
or nonparametric bootstrap as in Friedman et al. |14) . because they are func- 
tions of the DAGs Gb, b — 1, . . . , m through the underlying undirected graphs 
Uf, — {V,Eii). The resulting empirical probabilities 



i 
P 



rJ2h^^E,}m, (11) 



m 
6=1 



in particular 

P^ = :^^J2h<^'eE,}il^b) and p,j = —J2he.€E,^e,€E,}{l^b), (12) 

h=l fc=l 

can be used to obtain several descriptive measures and test statistics for the 
variability of the structure of a Bayesian network. 

4.1. Interpretation of bootstrapped networks 

Considering the undirected graphs Ui, . . . , lAm instead of the corresponding 
directed graphs ^i, . . . ^Qm greatly simplifies the interpretation of bootstrap's 
results. In particular the variability of the graphical structure can be summa- 
rized in three cases according to the entropy 23J of the set of the bootstrapped 
networks: 

• minimum entropy: all the networks learned from the bootstrap samples 
have the same structure, that is £^1 = £^2 = • ■ • = Em = E. This is the best 
possible outcome of the simulation, because there is no variability in the 
estimated network. In this case the first two moments of the multivariate 
Bernoulli distribution are equal to 

Pi^ I and E = O. (13) 

I otherwise 

• intermediate entropy: several network structures are observed with dif- 
ferent frequencies ?7if,, ^ m;, = m. The first two sample moments of the 
multivariate Bernoulli distribution are equal to 

Pi ^ — y^ rub and pij = — y^ mi,. (14) 

m ^-^ m ^-^ 

b-.eiGEb b:eieEi,,ejeEi 

• maximum entropy: all 2l^l'^l^l^^"^ possible network structures appear 
with the same frequency, that is 

^^l"J - 2|V|(|V|-l)/2 l-l,...,2 . (,iOJ 

This is the worst possible outcome because edges vary independently of 
each other and each one is present in only half of the networks (proof 



provided in AppendixB ) : 



Pi = - and S = -h- (16) 

This is also the only case in which all eigenvalues reach their maximum, 
that is Ai = A2 = . . . = Afe = 1/4. 



4-8. Descriptive statistics of network variability 

Several functions have been proposed in literature as univariate measures 
of spread of a multivariate distribution, usually under the assumption of mul- 
tivariate normality (see for example Muirhead |24j and Bilodeau and Brenner 
|25l). Three of them in particular can be used as descriptive statistics for the 
multivariate Bernoulli distribution: 

• the generalized variance, \/ARg{'E) — det(I]). 

• the total variance, \/ARt{T,) = tr(I]), also called total variation in Mardia 
et al. 1^. 



the squared Frobenius matrix norm, VARjv(S) = |||S — (/c/4)/j. 



12 

If- 



Both generalized and total variance associate high values of the statistic 
to unstable network structures, and are bounded due to the properties of the 
covariance matrix S. For the total variance it's easy to show that 

k 
Os^VART(S) = ^a,, s^-fc. (17) 

1=1 

The generalized variance is similarly bounded due to Hadamard's theorem on 
the determinant of a non- negative definite matrix [27j : 

0<VARG(E)<[]a,, < (j) . (18) 



x4, 

i=l ^ '' 

They reach the respective maxima in the maximum entropy case and are equal 
to zero only in the minimum entropy case. The generalized variance is also 
strictly convex (the maximum is reached only for E — (1/4)/^), but it is equal 
to zero if T, is rank deficient. For this reason it's convenient to reduce S to 
a smaller, full rank matrix (let's say E*) and compute VARg(S*) instead of 
VARg(S). 

The squared difference in Frobenius norm between S and k times the maxi- 
mum entropy covariance matrix associates high values of the statistic to stable 
network structures. It can be rewritten in terms of the eigenvalues Ai, . . . , A^ 
of E as 

VAR^(E)=^(a,-^) . (19) 

It has a unique maximum (in the minimum entropy case), which can be com- 
puted as the solution of the constrained minimization problem in A = [Ai, . . . , A^]"'" 

min/(A) = -^( A, --j subject to A, > 0, ^ A, ^ - (20) 



using Lagrange multipliers [55] • It also has a single minimum in A* = [1/4, . . . , 1/4], 
which is the projection of [k/A, . . . , fc/4] onto the set V and coincides with the 
maximum entropy case. The proof for these boundaries and the rationale behind 



the use of {k/4)Ik instead of (1/4)//; are reported in AppendixA 
The corresponding normalized statistics are: 

_ VARt(S) _ 4VARr(S) 
^^^"-^^^ - maxsVART(S) " k ^^^^ 

VARg(S) = ,,^p ;^, = 4'=VARg(S) (22) 

maxs VARg(2j) 

^ maxs VARAr(S) ~ VARa,(S]) ^ k^ - 16VARjv(S) 

^^ '^ maxsVARA,(E)-minsVARw(I]) fc(2fc - 1) ' ^ ' 

All of them vary in the [0, 1] interval and associate high values of the statistic to 
networks whose structure display a high variability across the bootstrap sam- 
ples. Equivalently we can define their complements VARt(S), VARg(I]) and 

VARAr(I]), which associate high values of the statistic to networks with little 
variability and can be used as measures of distance from the maxim,um, entropy 
case. 

^.3. Asymptotic inference 

The limiting distribution of the descriptive statistics defined above can be 
derived by replacing the covariance matrix E with its unbiased estimator E and 
by considering the multivariate Gaussian distribution from equation [91 The 
hypothesis we are interested in is 

Ho:ll=hk ffi : S ^ \h, (24) 

which relates the sample covariance matrix with the one from the maximum 
entropy case. 

For the total variance we have that tx — 4r7itr(I]) ~ Xmfc [21] j and since 
the maximum value of tr(I]) is achieved in the maximum entropy case, the 
hypothesis in Equation |24] assumes the form 

Ho : tr(S) = ^ i/i : tr(E) < ^. (25) 

Then the observed significance value is ar = P(iT ^ ^t **)' ^^^ "^^^ be improved 
with the finite sample correction 

5^ ^P{tr^ tT I tT e [0, mk]) = pf^^^^'^^j (26) 

rytT ^ mk) 



which accounts for the bounds on VARt(E) from inequality 17 

For the generalized variance there are several possible asymptotic and ap- 
proximate distributions: 



• the Gaussian distribution defined in Anderson 

' det(S) 



ici = V"i 



det(i/fe) 



1 



N{0,2k). 



(27) 



• the Gamma distribution defined in Steyn 
to 



mfc J det(S) . fk{m + l-k) 

2 \ det ah) \ 2 



det(|4 



(28) 



• the saddlepoint approximation defined in Butler et al. |31j . 
As before the hypothesis in Equation [24] assumes the form 



Ho : det(E) = det ( -h 



Hi : det(E) < det ( -h 



(29) 



The observed significance values for the Gaussian and Gamma distributions are 
aci — P{tGi ^ ^g") ^^^ ^G2 = P(^G2 ^ ^g")' ^^"^ t^^'5 respective finite sample 
corrections for the bounds on det(S]) are 



aGi = P (tGi s^ icT I tG, e [-V^, 0]) 



P(^G, ^ t^r) - P(^G, SC -V^) 



5g2 = P iG2 S^ to ' I iG. € 



0, 



TTlfc 



P{tG, ^0)- P{tG^ ^ -V^) 

pjtG, ^ to:) 

p(iG. s; f) ■ 



(30) 

(31) 



The test statistic associated with the squared Frobenius norm is the test for 
the equality of two covariance matrices defined in Nagao |32] , 



m 



because 



tr 



^l\h 



4S-/t 



tr 



4S - h 



Xife(fc+i)' (32) 



16E ^ 



16|||S--4|||2, 



(33) 



See AppendixA for an explanation of the use of (1/4)/^ instead of {k/A)Ik- The 
significance value for t^ is a^ — P{tN ^ ^n^) ^s the hypothesis in Equation 24 
becomes 



ffn 



|S--4|||f = 



7Ji:|||I]--/fe|||F>0. 



(34) 



Unlike the previous statistics, Nagao's test displays a good convergence speed, 
to the point that the finite sample correction for the bounds on the squared 
Frobenius matrix norm 

a.N -yKtN^tjq |iGie[U,i^ \) - — — „„^. (ci5j 

is not appreciably better than the raw significance value (see Table^for a simple 
example) . 

Jf..^. Monte Carlo inference and parametric bootstrap 

Another approach to compute the significance values associated with VARt(S), 
VARg(E) and VARjv(S) is applying again parametric bootstrap. 

The multivariate Bernoulli distribution Wq specified by the hypothesis in 



24 has a diagonal covariance matrix, so its components Wq., i — l,...,k 
are uncorrelated. According to Theorem [T] they are also independent, so the 
joint distribution of Wq is completely specified by the marginal distributions 
Wq. ^ Ber{l/2). Therefore it's possible (and indeed quite easy) to generate ob- 
servations from the null distribution and use them to estimate the significance 
value of the normalized statistics VAR7-(I]), VARg(S) and VARjv(S) defined in 
section 14.21 



1. compute the value of test statistic T on the original covariance matrix E. 

2. For r = 1,2, ...,i?. 

(a) generate m sets of k random samples from a Ber{\/2) distribution. 

(b) compute their covariance matrix E*. 

(c) compute T* from E* 

3. compute the Monte Carlo significance value as ur ~ (l/i?) X!r=i '^{x'^T}{Tr , 

This approach has two important advantages over the parametric tests de- 
fined in section llTsl 



• the test statistic is evaluated against its true null distribution instead of its 
asymptotic approximation, thus removing any distortion caused by lack of 
convergence (which can be quite slow and problematic in high dimensions). 

• each simulation r has a lower computational cost than the equivalent appli- 
cation of the structure learning algorithm to a bootstrap sample h. There- 
fore the Monte Carlo test can achieve a good precision with a smaller 
number of bootstrapped networks, allowing its application to larger prob- 
lems. 



5. A simple example 

Consider the multivariate Bernoulli distributions Wi, W2 and W3 with 
second order moments 
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Figure 1: The covariance matrices Ei, E2 and S3 represented as functions of their eigenval- 
ues in D (grey). The points (0,0) and (1/4,1/4) correspond to the minimum entropy and 
maximum entropy cases. 



associated with two (increasingly correlated) arcs from networks. The eigenval- 
ues of El, E2 and E3 are 



Ai- 



0.28 
0.20 



A2 = 



0.2121 
0.095 



and 



A, = 



0.3069 
0.0003 



(37) 



The values of the generalized variance, total variance and squared Frobenius 
matrix norm (both normalized and in the original scale) for the three covariance 
matrices are reported int Table [T] 

The corresponding asymptotic and Monte Carlo significance values are re- 
ported in Table [2] and [3] respectively. Each one has been computed for various 
hypothetical sample sizes {m = 10,20,50,100,200). Parametric bootstrap has 
been performed on i? = 10^ covariance matrices generated from the null distri- 
bution for each configuration of test statistic and sample size. 





VARt(E) 


VARg(E) 


VARw(E) 


VARt(E) 


VARg(E) 


VARw(E) 


El 
E2 
S3 


0.48 

0.3072 

0.3072 


0.056 
0.02016 
8.96 X 10-5 


0.1384 
0.2468 
0.2869 


0.96 

0.6144 

0.6144 


0.896 

0.32256 

0.00143 


0.9642 
0.6752 
0.5682 



Table 1: Original and normalized values of VARy, VARq and VARjv for Si, S2 and S3. 
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iT(S) 




10 


20 


50 


100 


200 


Si 


0.491137 


0.457610 


0.405404 


0.354943 


0.291243 


0.906041 


0.863836 


0.781414 


0.691495 


0.571734 


S2 


0.094193 


0.026330 


0.000852 


0.000003 


0.000000 


0.173766 


0.049704 


0.001644 


0.000007 


0.000000 


S3 


0.094193 


0.026330 


0.000852 


0.000003 


0.000000 


0.173766 


0.049704 


0.001644 


0.000007 


0.000000 


taj^) 


Si 


0.603944 


0.524258 


0.423183 


0.341131 


0.250054 


0.905218 


0.847522 


0.735799 


0.616696 


0.465129 


S2 


0.121488 


0.023514 


0.000278 


0.000000 


0.000000 


0.182091 


0.0380138 


0.000484 


0.000000 


0.000000 


S3 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


tAr(S) 


Si 


0.965205 


0.909123 


0.714937 


0.436839 


0.142271 


0.964547 


0.909108 


0.714937 


0.436839 


0.142271 


S2 


0.564938 


0.253762 


0.017090 


0.000142 


0.000000 


0.556708 


0.253636 


0.017090 


0.000142 


0.000000 


S3 


0.154551 


0.014796 


0.000008 


0.000000 


0.000000 


0.138557 


0.014628 


0.000008 


0.000000 


0.000000 



Table 2; Asymptotic significance values of tx, tQ^ and tjv for ^i, 
computed with the finite sample corrections are reported in bold. 



E2 and E3; the ones 



VARt(S) 




10 


20 50 


100 


200 


Si 
S2 
S3 


0.569655 
0.016834 
0.016834 


0.457109 0.129242 
0.000205 
0.000205 


0.017416 






0.000334 






VARg(S) 


Si 
S2 
S3 


0.784102 
0.063548 
0.005909 


0.512839 0.14788 
0.000761 
0.000008 


0.013678 






0.000094 






VARa,(I]) 


Si 
S2 
S3 


0.743797 
0.196996 
0.018292 


0.568819 0.239397 
0.037772 0.001018 
0.000355 


0.096544 
0.000005 



0.019633 







Table 3; Bootstrap significance values of VARy, VARq and VARjv from parametric bootstrap 
for El, S2 and E3. 
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6. Comparing independence tests and structure learning algorithms 

We will now illustrate how these tests can be used to compare different 
structure learning strategies, i.e. different combinations of structure learning 
algorithms, conditional independence tests and network scores. The impact of 
different choices for each component on the variability of the model can easily 
be assessed while keeping the other ones fixed. 

First we will compare the performance of the Grow-Shrink algorithm for 
three different conditional independence tests. The learning algorithm has been 
applied to samples of size 680, 685, 690, 695, 700, 705 and 710 (20 for each size) 
generated from the ALARM reference network |33j, which is composed by 37 
discrete nodes and 46 arcs for a total of 509 parameters. Both the data and the 
software implementation of the algorithm are included in the bnlearn package 
[34] for R [35]. The following tests have been considered: 

• the asymptotic x^ test based on mutual information [23] , which is in fact 
a log-likelihood ratio test and is also called the G^ test [36) . 

• the shrinkage estimator for the mutual information, which is a James-Stein 
regularized estimator developed by Hausser and Strimmer [37] . 

• Pearson's x^ asymptotic test for independence ^36, . 

The same threshold a — 0.05 for type I error has been used in three cases, and 
network variability has been assessed with the Monte Carlo test for the squared 
Frobenius norm. 

Results are shown in Figure [2] All the tests considered in the analysis 
start producing relatively stable network structures - i.e. the null hypothesis 



5 0.6 




680 685 690 695 700 705 710 680 685 690 695 700 705 710 680 685 690 695 700 705 710 

sample size 



Figure 2: Significance values for three different conditional independence tests (asymptotic and 
shrinkage estimators of mutual information and Pearson's x^) used with the same structure 
learning algorithm (Grow-Shrink). 
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corresponding to the maximum entropy case is rejected - at sample sizes 695 
and 700. Pearson's x^ test performs slightly better than mutual information, as 
documented in Agresti |36j when dealing with sparse contingency tables. This is 
also true for the shrinkage estimator. However, the difference among the three 
sets of significance values is very small. 

On the other hand we will now compare three different learning algorithms: 

• TABU search (which is a score-based algorithm), combined with a Bayesian 
Information criterion (BIG) score. 

• Grow-Shrink (which is a constraint-based algorithm) combined with the 
asymptotic x^ test based on mutual information described above and a = 
0.05. 

• Max-Min Hill Glimbing (which is hybrid algorithm), combined with a BIG 
score and the asymptotic mutual information test. 

As can be seen in Figure [3] in this case differences are more pronounced. The 
Max-Min Hill Glimbing algorithm, which is one of the top performers up to 
date for large networks, displays less variability than TABU search and Grow- 
Shrink at the same sample size. In particular the difference between Max-Min 
Hill Climbing and Grow-Shrink confirms the analysis made in Tsamardinos 
et al. [inj and the well-documented [3 instability displayed by constraint-based 
algorithms at small sample sizes. 




670675680685690695700705710 670675680685690695700705710 670675680685690695700705710 

sample size 



Figure 3: Significance values for three different structure learning algorithms (Grow-Shrink, 
TABU search and Max-Min Hill Climbing) using the same conditional independence tests and 
network scores (asymptotic mutual information test and the Bayesian Information Criterion 
(BIC), respectively). 
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7. Conclusions 

In this paper we derived the properties of several measures of variabihty for 
the structure of a Bayesian network through its underlying undirected graph, 
which is assumed to have a multivariate Bernoulli distribution. Descriptive 
statistics, asymptotic and Monte Carlo tests were developed along with their 
fundamental properties. They can be used to compare the performance of dif- 
ferent learning algorithms and to measure the strength of arbitrary subsets of 
arcs. 
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Appendix 

Appendix A. Bounds on the squared Probenius matrix norm 

The squared Frobenius matrix norm of the difference between the covariance 
matrix E and the maximum entropy matrix (1/4)//;; is 



2 

I „ 1 — ^ / I > 



j4iiil = E^-i)'- (A-i) 



i=l ^ 

Its unique global minimum is zero for S = (1/4)/^ but it has a varying number 
of global maxima depending on the dimension fc of S. They are the solutions 
of the constrained minimization problem 

min/(A) = -Ef A, --j subject to A, >0,EA, s^-. (A.2) 

This configuration of stationary points is not a problem for asymptotic and 
Monte Carlo tests, but prevents any direct interpretation of the values of de- 
scriptive statistics. 

On the other hand, the difference in squared Frobenius norm 

VAR^.(E) = |||E - ^41111, = E f^^ - j] (A.3) 
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Figure A. 4: Squared Frobenius matrix norms from (l/4)/x (on the left) and (fc/4)7fe (on the 
right) in 15 for k = 2. The green area is the set "D of the possible eigenvalues of S and the red 
lines are level curves. 



has both a unique global minimum (because it's a strictly convex function) 



minVARAr(S) = VARat ( -h 
V V4 



EG 



k 

4 4 



fc(fc-l)^ 
16 



(A.4) 



and a unique global maximum 



maxVARA,(E) = VAR7v(0) = ^ 



fc3 
16 



(A.5) 



which correspond to the minimum entropy (A — [0, ...,0]) and the m,axi- 
m,um entropy (A = [1/4, . . . , 1/4]) covariance matrices respectively (see figure 



A.4). However since {k/A)Ik is not a valid covariance matrix for a multivari- 
ate Bernoulli distribution, VARjv(E) cannot be used to derive any probabilistic 
result. 



AppendixB. Multivariate Bernoulli and the maximum entropy case 

The values oipi and S in the maximum entropy case are a direct consequence 
from the following theorem. 

Theorem 5. Let Ui,...,Un, n = 2", m == |V|(|V| - l)/2 be all possible 
undirected graphs with vertex set V and let P{Uk) — ^/n, k — l,...,n. Let 
Ci and Cj, i =/= j be two edges. Then P(e,j) = 1/2 and P{ei,ej) — 1/4. 

Proof. The number of possible configurations of an undirected graph is given 
by the Cartesian product of the possible states of its m edges, resulting in 



1(0,1} 



x{0,l}| = |{0,in=2" 



(B.l) 
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possible undirected graphs. Then edge ei is present in 2™ ^ graphs and e^ and 
Cj are simultaneously present in 2™~^ graphs. Therefore 



2"P(Wfc) "2 '''''^ ^^^e„e,j- 2mp(^^) ^ 4' 



V *V r,„n^7/ ^ o \ ^^ J) 2™P(Wfe) 4 ^ ^ 

n 



The fact that <Tij = for every i y^ j also proves that the edges are indepen- 
dent according to Theorem [l] 
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